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Abstract 

' This empirical study is mainly devoted to comparing four tree-based boosting algorithms: mart, 

I ^ , abc-mart, robust logitboost, and abc-logitboost, for multi-class classification on a variety of publicly 

' available datasets. Some of those datasets have been thoroughly tested in prior studies using a broad 

O , range of classification algorithms including SVM, neural nets, and deep learning. 

In terms of the empirical classification errors, our experiment results demonstrate: 

1 . Abc-mart considerably improves mart. 

2. Abc-logitboost considerably improves (robust) logitboost. 
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fSJ , 3. f/Jo/jMsfj Zog/f/jooif considerably improves marf on most datasets. 

' 4. Abc-logitboost considerably improves abc-mart on most datasets. 

5. These four boosting algorithms (especially abc-logitboost) outperform SVM on many datasets. 

■ 6. Compared to the best deep learning methods, these four boosting algorithms (especially abc- 

I logitboost) are competitive. 

. , . 1 Introduction 

X. 

. Boosting algorithms fl6l IH |5l HI [171 Ul [El |6l have become very successful in machine learning. In 

this paper, we provide an empirical evaluation of four tree-based boosting algorithms for multi-class 
classification: martl6\, abc-mart fTll . robust lositboost (l3l . and abc-logitboost ^ 1 2 II . on a wide range of 
datasets. 

Abc-boostf\\\, where "abc" stands for adaptive base class, is a recent new idea for improving 
multi-class classification. Both abc-mart i 1 1 .1 and abc-logitboost^^ are specific implementations of 
abc-boost. Although the experiments in flTl [T2l were reasonable, we consider a more thorough study 
is necessary. Most datasets used in ifTTl [T2l are (very) small. While those datasets (e.g., pendigits, 
zipcode) are still popular in machine learning research papers, they may be too small to be practically 
very meaningful. Nowadays, applications with millions of training samples are not uncommon, for 
example, in search engines |[T4|. 

It would be also interesting to compare these four tree-based boosting algorithms with other popular 



learning methods such as support vector machines (SVM) and deep learning. A recent study |9 



con- 



ducted a thorough empirical comparison of many learning algorithms including SVM, neural nets, and 



^ |http : / / www . iro . umontreal .ca/~lisa/ twiki /bin /view . cgi/Public/DeepVsShallowCoii:iparisonICML2 007| 
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deep learning. The authors of lH maintain a nice Web site from which one can download the datasets 
and compares the test mis-classification errors. 

In this paper, we provide extensive experiment results using mart, abc-mart, robust logitboost, and 
abc-logitboost on the datasets used in |9], plus other publicly available datasets. One interesting dataset 
is the UCI Poker. By private communications with C.J. Lin (the author of LibS VM), we learn that SVM 
achieved a classification accuracy of < 60% on this dataset. Interestingly, all four boosting algorithms 
can easily achieve > 90% accuracies. 

We try to make this paper self-contained by providing a detailed introduction to abc-mart, robust 
logitboost, and abc-logitboost in the next section. 



2 LogitBoost, Mart, Abc-mart, Robust LogitBoost, and Abc-LogitBoost 

We denote a training dataset by {yi, Xj}^^, where N is the number of feature vectors (samples), Xj is 
the ith feature vector, and yi € {0, 1, 2, K — 1} is the iih class label, where K > 3 in multi-class 
classification. 

Both logitboost[l \ and mart (multiple additive regression trees)|[6l algorithms can be viewed as 
generalizations to logistic regression, which assumes class probabilities pi^k as 

Pi^k = Pr {yi = k\^i) = r^_^ . (1) 

While traditional logistic regression assumes Fi^k{^i) = ji^'x.i, logitboost and mart adopt the flexible 
"additive model," which is a function of M terms: 

M 

fW(x) = ^p„/i(x;a^), (2) 

m=l 

where /i(x; a^), the base learner, is typically a regression tree. The parameters, pm and a^, are learned 
from the data, by maximum likelihood, which is equivalent to minimizing the negative log-likelihood 
loss 

N K-l 

L = ^Li, Lj = - ^ ri^k logPi,fc (3) 

where r j ^ = I if yi = k and ^ = otherwise. 

For identifiabihty, Ylk=o ~ sum-to-zero constraint, is routinely adopted |[71l6l[T9l 

[ElIIlllIIllIl. 



2.1 Logitboost 

As described in Alg. [U Q builds the additive model (O by a greedy stage-wise procedure, using a 
second-order (diagonal) approximation, which requires knowing the first two derivatives of the loss 
function Q with respective to the function values Fj fc. |7| obtained: 

iri,k - Pi,k) , = Pi,k (1 - Pi,k) ■ (4) 



dFi^k dF.^ 
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Those derivatives can be derived by assuming no relations among Fj jt, k = to K — I. However, Q 
used the "sum-to-zero" constraint "^j^S^ = throughout the paper and they provided an alternative 
explanation. |7| showed (01) by conditioning on a "base class" and noticed the resultant derivatives are 
independent of the choice of the base. 

Algorithm 1 LogitBoost[7, Alg. 6]. vis the shrinkage. 

0: k — 1, if Ui — k, rij. — otherwise. 

1: Fi^k = 0, K,fe = ^, k = to K - I, i = 1 to N 

2: For m = 1 to M Do 

3; For fc = to - 1, Do 

4: Compute Wi^k = Pi,k (1 - Pi,k)- 



8: End 

9: pi,k = exp{F.,,k)/ Ylfjo^ exp{Fi^s) 
10: End 



At each stage, logitboost fits an individual regression function separately for each class. This is 
analogous to the popular individualized regression approach in multinomial logistic regression, which 
is known E [D to result in loss of statistical efficiency, compared to the full (conditional) maximum 
likelihood approach. 

On the other hand, in order to use trees as base learner, the diagonal approximation appears to be a 
must, at least from the practical perspective. 

2.2 Adaptive Base Class Boost (ABC-Boost) 

|[Tn derived the derivatives of the loss function dS]) under the sum-to-zero constraint. Without loss of 
generality, we can assume that class is the base class. For any k ^ 0, 

dL- d^L- 

-WW- = {Tifi - Pi,o) - {ri,k - Pi,k) , -WT^ = Pj,o(l - Pifi) + " Pi,k) + "^PifiPi^k- (5) 

The base class must be identified at each boosting iteration during training. ifTTl suggested an exhaustive 
procedure to adaptively find the best base class to minimize the training loss Q at each iteration. 

fTTl combined the idea of abc-boost with mart. The algorithm, named abc-mart, achieved good 
performance in multi-class classification on the datasets used in ifTTTl . 

2.3 Robust LogitBoost 

The mart paperlH and a recent (2008) discussion paper lH commented that logitboost (Alg. [T]) can be 
numerically unstable. In fact, the logitboost paper|7] suggested some "crucial implementation protec- 
tions" on page 17 of iTTIl: 

• In Line 5 of Alg. [U compute the response Zi ^ by — (if ^ = 1) or ~^ (if Vi k = 0). 

' Pi,k ' Pi,k ' 

• Bound the response |zj by Zmax G [2, 4]. The value of Zmax is not sensitive as long as in [2, 4] 



7: 



5: 
6: 
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Note that the above operations were applied to each individual sample. The goal was to ensure that 
the response |zj fc| should not be too large. On the other hand, we should hope to use larger \zi^k\ to 
better capture the data variation. Therefore, this thresholding operation occurs very frequently and it is 
expected that part of the useful information is lost. 

The next subsection explains that, if implemented carefully, logitboost is almost identical to mart. 
The only difference is the tree-splitting criterion. 

2.4 Tree- Splitting Criterion Using Second-Order Information 

Consider N weights Wi, and N response values Zi, i = 1 to A^, which are assumed to be ordered 
according to the sorted order of the corresponding feature values. The tree-splitting procedure is to find 
the index s, 1 < s < A^, such that the weighted mean square error (MSE) is reduced the most if split at 
s. That is, we seek the s to maximize 



Gain{s) =MSEt - (MSEl + MSEr) 



N 



4 = 1 



.i=l 



N 



i=s+l 



where z = zl = ft^^, zr = ^7+' . After simpHfication, one can obtain 



Gain{s) = [^=1^^^^] + 



EA'' v^iv 



Plugging in wi = pi^kO- - Pi,k), Zi = J]:hl":\\ yields. 



Ei=i m 

* ~ Pl,fe(l-Pi,fc) 



22r=s+l Wi 



Ei=i Wi 



Gain{s) 



E?=i {n,k - Vi,k)f ^ \^i=s+i (^^.^' - P^^k)\ p.=i {n,k - Pi,k) 

T.i=lPi,k{^ ~ Pi,k) YliLs+lPiM.'^ - Pi,k) YA=lPi,k{'^ - Pi,k) 



Because the computations involve ^Pi,fc(l — Pi^k) as a group, this procedure is actually numerically 
stable. 

In comparison, mart^ only used the first order information to construct the trees, i.e.. 



MartGain[s) 



X] (.ri,k-Pi,k) 

.i=l 



N 



{ri,k - Pi,k) 

.i=s+l 



N 



X] ^^hk - Pi,k) 
.i=l 



Alg. |2]describes robust logitboost using the tree-splitting criterion in Sec. 12.41 Note that after trees 
are constructed, the values of the terminal nodes are computed by 

X^node Zi,kWi^k _ X^node i''~i,k ~ Pi,k) 



YlinodeWi,k J2node Pi,k{^ P 



i,k 



which explains Line 5 of Alg. [2l 
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Algorithm 2 Robust logitboost, which is very similar to mart, except for Line 4. 



1: Fi^k = 0, pi,k ^ ^, k = to K - 1, i = 1 to N 

2: For m = 1 to M Do 

3: For fc = to iiT - 1 Do 

4: {Rj,k,m}'^^i = J-terminal node regression tree from {ri.fc — Pi_fc, Xijj^j^, 
: with weights Pi,fe(l — Pi.k) as in Sec. 12.41 



6: Fi^k — Fi^k + '^J2j=i Pj,k,m^x,eRj,k.' 
7: End 

8: p^^k = exp(Fi^fe)/ X)^^^ exp{Fi^s) 
9: End 



2.5 Adaptive Base Class Logitboost (ABC-LogitBoost) 

The abc-boost [11] algorithm consists of two key components: 

1. Using the sum-to-zero constraintfTl l6l IT9l [TOl USl l2Tl l20l on the loss function, one can formulate 
boosting algorithms only for K — 1 classes, by treating one class as the base class. 

2. At each boosting iteration, adaptively select the base class according to the training loss. ifTTl 
suggested an exhaustive search strategy. 

ifm combined abc-boost with mart to develop abc-mart. More recently, ifTll developed abc- 
logitboost, the combination of abc-boost with (robust) logitboost. 

Algorithm 3 Abc-logitboost using the exhaustive search strategy for the base class, as suggested in [llil . 
The vector B stores the base class numbers. 

1: F,M = 0, p,,fc k = OtoK -1, i = ltoN 

2: For m = 1 to M Do 

3: For 6 = to - 1, Do 

4: For fc to - 1, fc 7^ &, Do 

5: {Rj,k,m}'j^i = J-terminal node regression tree from {-(n^f, - pi^t) + {ri^k - Pi,k), x,;}^^ 

: with weights Pj^,,(1 - Pi^b) +Pi,fc(l ~ Pi,k) +'2pi,bPi,k, asinSec.|24] 

_ I2^■eR■,^^^-iri,b~Pi,b) + {r^,k-p^,k) 

O: Pj,k,m - E,ciefl^._;,_„pVi>(l-P.,i>)+Pi.fc(l-Pi,0+2pi.i,Pi,fc 

7: Gi^k,b — Fi,k + J^X^j^i /^j,fc,mlxie-Rj,fc.™ 

8: End 

9: Gi^b,b = ^ '^k=tb Gi,k,b 

10: qi,k = exp{Gi,k.b)/ Ylfjo^ exp{Gi,s,b) 
11: Li'^^ = -EliEkJon,k\og{q,.k) 
12: End 

13: B{m) = argmin 

Lib) 

b 

14: Fi^k — Gi^k,B(m) 

15: pi.k = exp(^i,fc)/ Ylfjo^ exp{Fi^s) 

16: End 

Alg. [3]presents abc-logitboost, using the derivatives in ^ and the same exhaustive search strategy 
as in abc-mart. Again, abc-logitboost differs from abc-mart only in the tree-splitting procedure (Line 
5). 
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2.6 Main Parameters 



Alg. [2] and Alg. [3]have three parameters (J, v and M), to which the performance is in general not very 
sensitive, as long as they fall in some reasonable range. This is a significant advantage in practice. 

The number of terminal nodes, J, determines the capacity of the base learner. [6] suggested J = 6. 
|[7ll2ni commented that J > 10 is unlikely. In our experience, for large datasets (or moderate datasets 
in high-dimensions), J = 20 is often a reasonable choice; also see lfT4l for more examples. 

The shrinkage, v, should be large enough to make sufficient progress at each step and small enough 
to avoid over-fitting. [6] suggested v < 0.1. Normally, u = 0.1 is used. 

The number of boosting iterations, M, is largely determined by the affordable computing time. A 
commonly-regarded merit of boosting is that, on many datasets, over-fitting can be largely avoided for 
reasonable J, and u. 

3 Datasets 

Table [U lists the datasets used in our study. HH [13 provided experiments on several other (small) 
datasets. 



Table 1: Datasets 



dataset 


K 


# training 


#test 


# features 


Covertype290k 


7 


290506 


290506 


54 


Covertypel45k 


7 


145253 


290506 


54 


Poker525k 


10 


525010 


500000 


25 


Poker275k 


10 


275010 


500000 


25 


PokerlSOk 


10 


150010 


500000 


25 


PokerlOOk 


10 


100010 


500000 


25 


Poker25kTl 


10 


25010 


500000 


25 


Poker25kT2 


10 


25010 


500000 


25 


MnistlOk 


10 


10000 


60000 


784 


M-Basic 


10 


12000 


50000 


784 


M-Rotate 


10 


12000 


50000 


784 


M-Image 


10 


12000 


50000 


784 


M-Rand 


10 


12000 


50000 


784 


M-Rotlmg 


10 


12000 


50000 


784 


M-Noisel 


10 


10000 


2000 


784 


M-Noise2 


10 


10000 


2000 


784 


M-Noise3 


10 


10000 


2000 


784 


M-Noise4 


10 


10000 


2000 


784 


M-Noise5 


10 


10000 


2000 


784 


M-Noise6 


10 


10000 


2000 


784 


Letter 15k 


26 


15000 


5000 


16 


Letter4k 


26 


4000 


16000 


16 


Letter2k 


26 


2000 


18000 


16 



3.1 Covertype 

The original UCI Covertype dataset is fairly large, with 581012 samples. To generate Covertype290k, 
we randomly split the original data into halves, one half for training and another half for testing. For 
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Covertypel45k, we randomly select one half from the training set of Covertype290k and still keep the 
test set. 

3.2 Poker 

The UCI Poker dataset originally used only 25010 samples for training and 1000000 samples for testing. 
Since the test set is very large, we randomly divide it equally into two parts (I and II). Poker25kTl uses 
the original training set for training and Part I of the original test set for testing. Poker25kT2 uses the 
original training set for training and Part II of the original test set for testing. This way, Poker25kTl 
can use the test set of Poker25kT2 for validation, and Poker25kT2 can use the test set of Poker25kTl for 
validation. As the two test sets are still very large, this treatment will provide reliable results. 

Since the original training set (about 25k) is too small compared to the size of the test set, we enlarge 
the training set to form Poker525k, Poker275k, PokerlSOk, and PokerlOOk. All four enlarged training 
datasets use the same test set as Pokere25kT2 (i.e.. Part II of the original test set). The training set of 
Poker525k contains the original (25010) training set plus Part I of the original test set. Similarly, the 
training set of Poker275k I PokerlSOk I PokerlOOk contains the original training set plus 250k/125k/75k 
samples from Part I of the original test set. 

The original Poker dataset provides 10 features, 5 "suit" features and 5 "rank" features. While 
the "ranks" are naturally ordinal, it appears reasonable to treat "suits" as nominal features. By private 
communications, R. Cattral, the donor of the Poker data, suggested us to treat the "suits" as nominal. 
C.J. Lin also kindly told us that the performance of SVM was not affected whether "suits" are treated 
nominal or ordinal. In our experiments, we choose to use "suits" as nominal feature; and hence the total 
number of features becomes 25 after expanding each "suite" feature with 4 binary features. 

3.3 Mnist 

While the original Mnist dataset is extremely popular, this dataset is known to be too easylH. Originally, 
Mnist used 60000 samples for training and 10000 samples for testing. 

Mnist 10k uses the original (10000) test set for training and the original (60000) training set for 
testing. This creates a more challenging task. 

3.4 Mnist with Many Variations 

||9]| ( jwww. iro ■ umontre al ■ ca/-lisa /twiki /bin/ vi ew. cgi/Public/DeepVsSh allowCompa risonICML2007)) Created 

a variety of much more difficult datasets by adding various background (correlated) noise, background 
images, rotations, etc, to the original Mnist dataset. We shortened the notations of the generated datasets 
to be M-Basic, M-Rotate, M-Image, M-Rand, M-Rotlmg, and M-Noisel, M-Noise2 to M-Noise6. 

By private communications with D. Erhan, one of the authors of |9], we learn that the sizes of 
the training sets actually vary depending on the learning algorithms. For some methods such as SVM, 
they retrained the algorithms using all 120000 training samples after choosing the best parameters; and 
for other methods, they used 10000 samples for training. In our experiments, we use 12000 training 
samples for M-Basic, M-Rotate, M-Image, M-Rand and M-Rotlmg; and we use 10000 training samples 
for M-Noisel to M-Noise6. 

Note that the datasets M-Noisel to M-Noise6 have merely 2000 test samples each. By private com- 
munications with D. Erhan, we understand this was because [9| did not mean to compare the statistical 
significance of the test errors for those six datasets. 
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3.5 Letter 



The UCI Letter dataset has in total 20000 samples. In our experiments, Letter4k {Letterlk) use the last 
4000 (2000) samples for training and the rest for testing. The purpose is to demonstrate the performance 
of the algorithms using only small training sets. 

We also include LetterlSk, which is one of the standard partitions of the Letter dataset, by using 
15000 samples for training and 5000 samples for testing. 

4 Summary of Experiment Results 

We simply use logitboost (or even logit in the plots) to denote robust logitboost. 

Table [2] summarizes the test mis-classification errors. For all datasets except PokerlSkTl and 
Poker25kT2, we report the test errors with the tree size J=20 and shrinkage u = 0.1. For PokerlSkTl 
and PokerlSkTl, we use J = 6 and u = 0.1. We report more detailed experiment results in Sec. [5] 

For Covertypel90k, PokerSlSk, Pokerl75k, PokerlSOk, and PokerlOOk, as they are fairly large, we 
only train M = 5000 boosting iterations. For all other datasets, we always train M = 10000 iterations 
or terminate when the training loss ^ is close to the machine accuracy. Since we do not notice obvious 
over-fitting on those datasets, we simply report the test errors at the last iterations. 



Table 2: Summary of test mis-classification errors. 



Dataset 


mart 


abc-mart 


logitboost 


abc-logitboost 


#test 


Covertype290k 


11350 


10454 


10765 


9727 


290506 


Covertypel45k 


15767 


14665 


14928 


13986 


290506 


Poker525k 


7061 


2424 


2704 


1736 


500000 


Poker275k 


15404 


3679 


6533 


2727 


500000 


PokerlSOk 


22289 


12340 


16163 


5104 


500000 


PokerlOOk 


27871 


21293 


25715 


13707 


500000 


Poker25kTl 


43575 


34879 


46789 


37345 


500000 


Poker25kT2 


42935 


34326 


46600 


36731 


500000 


MnistlOk 


2815 


2440 


2381 


2102 


60000 


M-Basic 


2058 


1843 


1723 


1602 


50000 


M-Rotate 


7674 


6634 


6813 


5959 


50000 


M-Image 


5821 


4727 


4703 


4268 


50000 


M-Rand 


6577 


5300 


5020 


4725 


50000 


M-Rotlmg 


24912 


23072 


22962 


22343 


50000 


M-Noisel 


305 


245 


267 


234 


2000 


M-Noise2 


325 


262 


270 


237 


2000 


M-Noise3 


310 


264 


277 


238 


2000 


M-Noise4 


308 


243 


256 


238 


2000 


M-Noise5 


294 


244 


242 


227 


2000 


M-Noise6 


279 


224 


226 


201 


2000 


LetterlSk 


155 


125 


139 


109 


5000 


Letter4k 


1370 


1149 


1252 


1055 


16000 


Letter2k 


2482 


2220 


2309 


2034 


18000 
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4.1 P-Values 

Table |3] summarizes the following four types of P- values: 

• PI: for testing if abc-mart has significantly lower error rates than mart. 

• P2: for testing if (robust) logitboost has significantly lower error rates than mart. 

• P3: for testing if abc-logitboost has significantly lower error rates than abc-mart. 

• P4: for testing if abc-logitboost has significantly lower error rates than (robust) logitboost. 

The P-values are computed using binomial distributions and normal approximations. Recall, if a 
random variable z ~ Binomial{n,p), then the probability parameter p can be estimated by p = f^, and 
the variance of p can be estimated by p{l — p)/n. The P-values can then be computed using normal 
approximation of binomial distributions. 

Note that the test sets for M-Noisel to M-Noise6 are very small because [9] originally did not intend 
to compare the statistical significance on those six datasets. We compute their P-values anyway. 



Table 3: Summary of test P- Values. 



Dataset 


PI 


P2 


P3 


PA 


Covertype290k 


3 X 10"^" 


3 X lO-'' 


9 X 10-** 


8 X 10-1* 


Covertypel45k 


4 X 10"" 


4 X 10-^ 


2 X 10-5 


7 X 10-^ 


Poker525k 














Poker275k 














PokerlSOk 














PokerlOOk 














Poker25kTl 












Poker25kT2 












MnistlOk 


5 X 10-** 


3 X 10-1" 


1 X lO-'^ 


1 X 10-5 


M-Basic 


2 X lO-'' 


1 X 10-8 


1 X 10-5 


0.0164 


M-Rotate 





5 X 10-15 


6 X 10-" 


3 X 10-16 


M-Image 








2 X lO-'^ 


7 X lO-'^ 


M-Rand 








7 X 10-1" 


8 X 10-* 


M-Rotlmg 








2 X 10-6 


4 X 10-5 


M-Noisel 


0.0029 


0.0430 


0.2961 


0.0574 


M-Noise2 


0.0024 


0.0072 


0.1158 


0.0583 


M-Noise3 


0.0190 


0.0701 


0.1073 


0.0327 


M-Noise4 


0.0014 


0.0090 


0.4040 


0.1935 


M-Noise5 


0.0102 


0.0079 


0.2021 


0.2305 


M-Noise6 


0.0043 


0.0058 


0.1189 


0.1002 


Letter 15k 


0.0345 


0.1718 


0.1449 


0.0268 


Letter4k 


2 X 10-6 


0.008 


0.019 


1 X 10-5 


Letter2k 


2 X 10-5 


0.003 


0.001 


4 X 10-6 



The results demonstrate that abc-logitboost and abc-mart considerably outperform logitboost and 
mart, respectively. In addition, except for PokerlSkTl and Poker25kT2, we observe that abc-logitboost 
outperforms abc-mart, and logitboost outperforms mart. 
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4.2 Comparisons with SVM and Deep Learning 

For UCI Poker, we know that SVM could only achieve an error rate of about 40% (by private commu- 
nications with C.J. Lin). In comparison, all four algorithms, mart, abc-mart, (robust) logitboost, and 
abc-logitboost, could achieve much smaller error rates (i.e., < 10%) on PokerlSkTl and Poker25kT2. 

Figure [U provides the comparisons on the six (correlated) noise datasets: M-Noisel to M-Noise6. 
Table m compares the error rates on M-Basic, M-Rotate, M-Image, M-Rand, and M-Rotlmg. 




'l 23456 °1 23456 123456 

Degree of correlation Degree of correlation Degree of correlation 



Figure 1: Six datasets: M-Noisel to M-Noise6. Left panel: Error rates of SVM and deep learning fOl. 
Middle and right panels: Errors rates of four boosting algorithms. X-axis: degree of correlation from 
high to low; the values 1 to 6 correspond to the datasets M-Noisel to M-Noise6. 



Table 4: Summary of error rates of various algorithms on the modified Mnist datasetQ. 





M-Basic 


M-Rotate 


M-Image 


M-Rand 


M-Rotlmg 


SVM-RBF 


3.05% 


11.11% 


22.61% 


14.58% 


55.18% 


SVM-POLY 


3.69% 


15.42% 


24.01% 


16.62% 


56.41% 


NNET 


4.69% 


18.11% 


27.41% 


20.04% 


62.16% 


DBN-3 


3.11% 


10.30% 


16.31% 


6.73% 


47.39% 


SAA-3 


3.46% 


10.30% 


23.00% 


11.28% 


51.93% 


DBN-1 


3.94% 


14.69% 


16.15% 


9.80% 


52.21% 


mart 
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4.3 Performance vs. Boosting Iterations 



Figure |2] presents the training loss, i.e., Eq. on Covertype290k and Poker525k, for all boosting 
iterations. Figures [3] and |4]provide the test mis-classification errors on Covertype, Poker, MnistlOk, and 
Letter. 
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Figure 2: Training loss, Eq. Q, on Covertype290k and Poker525k. 
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Figure 3: Test mis-classification errors on MnistlOk, LetterlSk, Letter4k, and Letter2k. 
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Figure 4: Test mis-classification errors on Covertype and Poker. 
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5 More Detailed Experiment Results 



Ideally, we would like to demonstrate that, with any reasonable choice of parameters J and v, abc-mart 
and abc-logitboost will always improve mart and logitboost, respectively. This is actually indeed the 
case on the datasets we have experimented. In this section, we provide the detailed experiment results 
on MnistlOk, Poker25kTl, Poker25kT2, Letter4k, and Letterlk. 

5.1 Detailed Experiment Results on MnistlOk 

For this dataset, we experiment with every combination of J G {4, 6, 8, 10, 12, 14, 16, 18, 20, 24, 30, 40, 50} 
and V G {0.04, 0.06, 0.08, 0.1}. We train the four boosting algorithms till the training loss Q is close 
to the machine accuracy, to exhaust the capacity of the learner so that we could provide a reliable com- 
parison, up to M = 10000 iterations. 

Table [5] presents the test mis-classification errors and Table |6] presents the P- values. Figures [51 |6l 
and|7]provide the test mis-classification errors for all boosting iterations. 

Table 5: MnistlOk. Upper table: The test mis-classification errors of mart wA abc-mart (bold numbers). 
Bottom table: The test mis-classification errors of logitboost and abc-logitboost (bold numbers) 
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Table 6: MnistlOk. P- values. See Sec. Ofor the definitions of PI, P2, P3, and P4. 



PI 









u = 0.04 


u = 0.06 


V = 0.08 


V = 0.1 


J 




4 


7 X 10"'' 


3 X 10-5 


7 X 10-1° 


1 X 10-12 


J 




6 


8 X 10~3 


1 X 10-1° 


9 X 10-11 





J 




8 


9 X 10-12 


4 X 10-12 


5 X 10"" 


2 X 10-1° 


J 




10 


4 X 10"" 


2 X 10-1° 


4 X 10-11 


3 X 10-11 


J 




12 


1 X 10"9 


7 X 10-11 


1 X 10-1° 


3 X 10-9 


J 




14 


6 X lO-i*^ 


1 X 10-^ 


6 X 10-9 


9 X 10-1° 


J 




16 


2 X 10"9 


3 X 10-1° 


6 X 10-9 


5 X 10-9 


J 




18 


3 X 10"* 


2 X 10-9 


6 X 10-1° 


9 X 10-9 


J 




20 


2 X 10"* 


3 X 10-** 


2 X 10-8 


6 X 10-8 


J 




24 


2 X 10-8 


1 X 10-** 


6 X 10-8 


2 X 10"° 


J 




30 


1 X lO-'^ 


5 X 10-** 


2 X 10-^ 


2 X lO-'^ 


J 




40 


3 X lO-'^ 


1 X 10-^ 


2 X 10-* 


5 X 10-5 


J 




50 


6 X lO"'* 


1 X 10-^ 


3 X 10-^ 


3 X 10-5 


P2 








u = 0.04 


V = 0.06 


V = 0.08 


V = 0.1 


J 




4 


2 X 10-** 


2 X lO-*" 


6 X 10-° 


3 X 10-° 


J 




6 


1 X 10-1'^ 


4 X 10-** 


9 X 10-9 


8 X 10-12 


J 




8 


4 X 10-1° 


2 X 10-9 


1 X 10-1° 


1 X 10-9 


J 




10 


7 X 10-11 


4 X 10-1° 


3 X 10-11 


2 X 10-11 


J 




12 


1 X 10-1° 


2 X 10-1° 


2 X 10-11 


3 X 10-1° 


J 




14 


2 X 10-11 


8 X 10-12 


2 X 10-1° 


3 X 10-11 


J 




16 


1 X 10-11 


8 X 10-11 


7 X 10-12 


3 X 10-11 


J 




18 


5 X 10-11 


9 X 10-12 


6 X 10-12 


9 X 10-12 


J 




20 


2 X 10-1° 


2 X 10-9 


1 X 10-9 


4 X 10-1° 


J 




24 


1 X 10"* 


3 X 10-9 


3 X 10-** 


1 X lO-'^ 


J 




30 


2 X lO-'^ 


2 X 10-** 


5 X 10-9 


2 X lO-'^ 


J 




40 


3 X 10-5 


1 X 10-5 


4 X 10"° 


2 X 10-* 


J 




50 


0.0026 


0.0023 


3 X 10-1 


0.0013 


P3 








u = 0.04 


V = 0.06 


V = 0.08 


u = 0.1 


J 




4 


3 X 10-^ 


5 X 10-9 


4 X 10-° 


7 X 10-° 


J 




6 


4 X 10-" 


2 X 10-** 


2 X 10-1° 


3 X 10-8 


J 




8 


2 X 10-9 


3 X 10-1° 


3 X 10-1° 


6 X 10-11 


J 




10 


1 X 10-1° 


8 X 10-1° 


6 X 10-11 


4 X 10-1° 


J 




12 


2 X 10-1° 


2 X 10-** 


1 X 10-9 


1 X 10-9 


J 




14 


5 X 10-1° 


6 X 10-9 


4 X 10-1° 


4 X 10-1° 


J 




16 


2 X 10"* 


2 X 10-^ 


1 X 10"* 


1 X 10-8 


J 




18 


4 X 10-9 


8 X 10-9 


6 X 10-8 


3 X 10-8 






20 


1 X 10-'' 


2 X 10-^ 


6 X 10-8 


2 X lO-'^ 


J 




24 


2 X 10-5 


9 X 10-° 


3 X 10"° 


9 X lO-'^ 


J 




30 


5 X 10-1 


0.0011 


1 X 10-* 


2 X 10-5 


J 




40 


0.0056 


0.0103 


0.0024 


1 X 10-* 


J 




50 


0.0145 


0.0707 


0.0218 


0.0102 


P4 








u = 0.04 


p = 0.06 


u = 0.08 


V = 0.1 


J 




4 


1 X 10-5 


2 X lO-' 


4 X 10-1" 


5 X 10-12 


J 




6 


5 X 10-11 


7 X 10-11 


1 X 10-12 


6 X 10-1=* 


J 




8 


4 X 10-11 


5 X 10-13 


2 X 10-12 


8 X 10-12 


J 




10 


6 X 10-11 


5 X 10-1° 


8 X 10-11 


7 X 10-1° 


J 




12 


2 X 10-9 


6 X 10-9 


6 X 10-9 


1 X 10-8 


J 




14 


1 X 10"* 


4 X 10-^ 


1 X 10-8 


9 X 10-9 


J 




16 


1 X 10-^ 


5 X 10-^ 


3 X 10-° 


9 X lO-'^ 


J 




18 


1 X 10-** 


8 X 10-^ 


2 X 10-° 


8 X 10"° 


J 




20 


4 X 10-5 


2 X 10-° 


8 X 10-^ 


1 X 10-5 


J 




24 


3 X 10-5 


3 X 10-5 


7 X 10-° 


1 X 10-5 


J 




30 


3 X 10-"' 


0.0016 


0.0012 


2 X 10-5 


J 




40 


2 X lO-"' 


5 X 10-* 


6 X 10-5 


3 X 10-5 


J 




50 


9 X 10-5 


7 X 10-5 


2 X 10-* 


4 X 10"* 



14 



|\ MnistlOk: J = 


4,v = 0.04 








mart 

nhr— mnrt 








logit 






abc-logil 



5000 



I 4000 



3000 



2000 4000 6000 8000 10000 
Boosting iterations 



2000 



11 


MnistlOk: J 


= 4,v = 


0.06 
mart 






ibfcmart 


logit 








abc-logit 



1 




MnistlOk: J = 6,v = 0.04 



mart 



abc-marl 



logit 



abc-logil 



2000 4000 6000 8000 10000 
Boosting iterations 




2000 4000 6000 8000 10000 
Boosting iterations 





MnistlOk: J : 


= 10,v 


= 0.04 










mart 








^ abc-mart 






logit 


logit 








-abc- 





2000 4000 6000 8000 10000 
Boosting iterations 



2 5000 
o 

» 4500 

1 4000 
o 

« 3500 
ra 

V 3000 
w 

B 2500 

w 

a> 

I- 2000. 

2 5000 
o 

I 4000 
o 

"w 

V) 

ro 

V 3000 
w 

E 

to 

2000. 

S 5000 
o 

® 4500 

i 4000 
o 

« 3500 
ra 

7 3000 
tn 

H 2500 
w 

2000 



1 



2000 4000 6000 8000 10000 
Boosting iterations 




MnistlOk: J 


= 6, V = 0.06 














mart 








___abc-mart 


logit 


abc-logil 



2000 4000 6000 8000 10000 
Boosting iterations 



S 5000 
o 

» 4500 

■| 4000 
o 

I 3500 
a 

V 3000 
^ 2500 

CO 
0) 

I- 2000 



1 



MnistlOk: J 


= 8,v = 


= 0.06 








mart 


V Ibgit 


abc 


-mart 




-abc-logit 





5000 



4000 



V 3000 



2000 4000 6000 8000 10000 
Boosting iterations 



2000 



MnistlOk: 


J = 10,v = 0.06 












mart 










abc-mart 




^ogit 


abc-logit 



2000 4000 6000 
Boosting iterations 



8000 



S2 5000 
o 

S 4500 

i 4000 
o 

I 3500 
a 

V 3000 
^ 2500 

CD 

I- 2000 



2000 4000 6000 8000 10000 
Boosting iterations 





MnistlOk: J 


= 6,v 


= 0.1 
























mart 
















V^Jogit 


^ abc-mart 


- abc-logit 



2000 4000 6000 8000 10000 
Boosting iterations 





MnistlOk: 


J = 8 


,v = 


.1 










mart 










abc 


-marl 




— — 


Ibgit 
— a 


bc-logit 



1000 2000 3000 4000 5000 6000 
Boosting iterations 



MnistlOk: J = 10,v = 0.1 


















mart 






abc-mart 




"Togit 
— i 


ibc-logit 



1000 2000 3000 
Boosting iterations 



4000 



Figure 5: MnistlOk.Test mis-classification errors of four algorithms. J = 4, 6, 8, 10. 
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igure 6: MnistlOk. Test mis-classification errors of four algoritiims. J = 12, 14, 16, 18. 
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gure 7: MnistlOk. Test mis-classification errors of four algorithms. J = 20, 24, 30, 40, 50. 
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The experiment results illustrate that the performances of all four algorithms are stable on a wide- 
range of base class tree sizes J, e.g., J G [6, 30]. The shrinkage parameter v does not affect much the 
test performance, although smaller v values result in more boosting iterations (before the training losses 
reach the machine accuracy). 

We further randomly divide the test set of MnistlOk (60000 test samples) equally into two parts (I 
and II). We then test algorithms on Part I (using the same training results). We name this "new" dataset 
MnistlOkTl. The purpose of this experiment is to further demonstrate the stability of the algorithms. 

Table |7] presents the test mis-classification errors of MnistlOkTl. Compared to Table [5] the mis- 
classification errors of MnistlOkTl are roughly 50% of the mis-classification errors of MnistlOk for all J 
and V. This helps establish that our experiment results on MnistlOk provide a very reliable comparison. 

Table 7: MnistlOkTl. Upper table: The test mis-classification errors of mart and abc-mart (bold num- 
bers). Bottom table: The test mis-classification errors of logitboost and abc-logitboost (bold numbers). 
MnistlOkTl only uses a half of the test data of MnistlOk. 









mart 




abc-mart 
















V = 0.04 


V = 0.06 


V = 0.08 


V = 0.1 


J 


— 


4 


1682 


1514 


1668 1505 


1666 


1416 


1663 


1380 


J 


— 


6 


1573 


1382 


1523 1320 


1533 


1329 


1582 


1288 


J 




8 


1501 


1263 


1515 1257 


1523 


1250 


1491 


1279 


J 


— 


10 


1492 


1270 


1457 1248 


1470 


1239 


1459 


1236 


J 




12 


1432 


1244 


1427 1234 


1444 


1228 


1436 


1227 


J 




14 


1424 


1237 


1420 1231 


1407 


1223 


1419 


1212 


J 




16 


1430 


1226 


1426 1224 


1411 


1223 


1418 


1204 


J 




18 


1400 


1222 


1413 1218 


1390 


1210 


1404 


1211 


J 




20 


1398 


1213 


1381 1205 


1388 


1213 


1382 


1198 


J 




24 


1402 


1221 


1366 1201 


1372 


1199 


1346 


1205 


J 




30 


1384 


1211 


1374 1208 


1368 


1224 


1366 


1205 


J 




40 


1397 


1244 


1375 1220 


1397 


1222 


1365 


1246 


J 




50 


1371 


1239 


1380 1221 


1382 


1223 


1362 


1242 








logitboost 


abc-logit 
















V = 0.04 


V = 0.06 


V = 0.08 


V = 0.1 


J 




4 


1419 


1299 


1449 1281 


1446 


1251 


1460 


1244 


J 




6 


1313 


1111 


1313 1114 


1326 


1101 


1317 


1097 


J 




8 


1278 


1058 


1287 1050 


1270 


1036 


1262 


1058 


J 




10 


1252 


1061 


1244 1057 


1237 


1040 


1229 


1041 


J 




12 


1224 


1020 


1219 1049 


1217 


1053 


1224 


1047 


J 




14 


1213 


1038 


1207 1050 


1201 


1039 


1198 


1026 


J 




16 


1185 


1050 


1205 1058 


1189 


1044 


1178 


1041 


J 




18 


1186 


1048 


1184 1038 


1184 


1046 


1167 


1056 


J 




20 


1185 


1077 


1199 1063 


1183 


1042 


1184 


1045 


J 




24 


1208 


1095 


1196 1083 


1191 


1064 


1194 


1068 


J 




30 


1225 


1113 


1201 1117 


1190 


1113 


1211 


1087 


J 




40 


1254 


1159 


1247 1145 


1248 


1127 


1249 


1127 


J 




50 


1292 


1177 


1284 1174 


1275 


1161 


1276 


1176 
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5.2 Detailed Experiment Results on PokerlSkTl and Poker25kT2 



Recall the original UCI Poker dataset used 25010 samples for training and 1000000 samples for testing. 
To provide a reliable comparison (and validation), we form two datasets Poker25kTl and Poker25kT2 by 
equally dividing the original test set into two parts (I and II). Both use the same training set. PokerlSkTl 
uses Part I of the original test set for testing and Poker25kT2 uses Part II for testing. 

Tableland Table |9]present the test mis-classification errors, for J € {4, 6, 8, 10, 12, 14, 16, 18, 20} 
and u G {0.04, 0.06, 0.08, 0.1}. Comparing these two tables, we can see the corresponding entries are 
very close to each other, which again verifies that the four boosting algorithms provide reliable results 
on this dataset. 

For most J and u, all four algorithms achieve error rates < 10%. For both Poker25kTl and 
Poker25kT2, the lowest test errors are attained at = 0.1 and J = 6. Unlike MnistlOk, the test er- 
rors, especially using mart and logitboost, are slightly sensitive to the pai^ameters. 

Note that when J = 4 (and v is small), only training M = 10000 steps would not be sufficient in 
this case. 

Table 8: PokerlSkTl. Upper table: The test mis-classification errors of mart and abc-mart (bold num- 
bers). Bottom table: The test mis-classification errors of logitboost and abc-logitboost (bold numbers) 







mart 


abc-mart 










V = 0.04 


V = 0.06 


V = 0.08 


V = 0.1 


J = 


4 


145880 90323 


132526 67417 


124283 49403 


113985 42126 


J = 


6 


71628 38017 


59046 36839 


48064 35467 


43573 34879 


J = 


8 


64090 39220 


53400 37112 


47360 36407 


44131 35777 


J = 


10 


60456 39661 


52464 38547 


47203 36990 


46351 36647 


J = 


12 


61452 41362 


52697 39221 


46822 37723 


46965 37345 


J = 


14 


58348 42764 


56047 40993 


50476 40155 


47935 37780 


J = 


16 


63518 44386 


55418 43360 


50612 41952 


49179 40050 


J = 


18 


64426 46463 


55708 45607 


54033 45838 


52113 43040 


J = 


20 


65528 49577 


59236 47901 


56384 45725 


53506 44295 






logitboost 


abc-logit 










V = 0.04 


V = 0.06 


V = 0.08 


V = 0.1 


J = 


4 


147064 102905 


140068 71450 


128161 51226 


117085 42140 


J ^ 


6 


81566 43156 


59324 39164 


51526 37954 


48516 37546 


J = 


8 


68278 46076 


56922 40162 


52532 38422 


46789 37345 


J = 


10 


63796 44830 


55834 40754 


53262 40486 


47118 38141 


J = 


12 


66732 48412 


56867 44886 


51248 42100 


47485 39798 


J = 


14 


64263 52479 


55614 48093 


51735 44688 


47806 43048 


J = 


16 


67092 53363 


58019 51308 


53746 47831 


51267 46968 


J = 


18 


69104 57147 


56514 55468 


55290 50292 


51871 47986 


J = 


20 


68899 62345 


61314 57677 


56648 53696 


51608 49864 
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Table 9: Poker25kT2. Upper table: The test mis-classification errors of mart and abc-mart (bold num- 
bers). Bottom table: The test mis-classification errors of logitboost and abc-logitboost (bold numbers) 









mart 


abc-mart 












V = 0.04 


V = 0.06 


V = 0.08 


i/ = 0.1 


J 




4 


144020 89608 


131243 67071 


123031 48855 


113232 41688 


J 




6 


71004 37567 


58487 36345 


47564 34920 


42935 34326 


J 




8 


63452 38703 


52990 36586 


46914 35836 


43647 35129 


J 




10 


60061 39078 


52125 38025 


46912 36455 


45863 36076 


J 




12 


61098 40834 


52296 38657 


46458 37203 


46698 36781 


J 




14 


57924 42348 


55622 40363 


50243 39613 


47619 37243 


J 




16 


63213 44067 


55206 42973 


50322 41485 


48966 39446 


J 




18 


64056 46050 


55461 45133 


53652 45308 


51870 42485 


J 




20 


65215 49046 


5891147430 


56009 45390 


53213 43888 








logitboost 


abc-logit 












V = 0.04 


V = 0.06 


u = 0.08 


u = 0.1 


J 




4 


145368 102014 


138734 70886 


126980 50783 


11634641551 


J 




6 


80782 42699 


58769 38592 


51202 37397 


48199 36914 


J 




8 


68065 45737 


56678 39648 


52504 37935 


46600 36731 


J 




10 


63153 44517 


55419 40286 


52835 40044 


46913 37504 


J 




12 


66240 47948 


56619 44602 


50918 41582 


47128 39378 


J 




14 


63763 52063 


55238 47642 


51526 44296 


47545 42720 


J 




16 


66543 52937 


57473 50842 


53287 47578 


5110646635 


J 




18 


68477 56803 


57070 55166 


54954 49956 


51603 47707 


J 




20 


6831161980 


61047 57383 


56474 53364 


51242 49506 
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5.3 Detailed Experiment Results on Letter4k and Letterlk 

Table 10: Letter4k. Upper table: Thetestmis-classification errors of mar? and aic-marf (bold numbers). 
Bottom table: The test mis-classification errors of logitboost and abc-logUboost (bold numbers) 











mart 


abc-mart 














y — 


n (14 


V = 


0.06 


V — 


U.Uo 


V — U. i 


J 




4 


1681 


1415 


1660 


1380 


1671 


1368 




J 




6 


1618 


1320 


1584 


1288 


1588 


1266 


1 577 1240 


T 
J 




e 
o 


1531 


1266 


1522 


1246 


1 J ID 




1184 

LO^L XXOf 


J 

J 




1 n 


1499 


1228 


1463 


1208 


1 ZL7Q 




1 470 1 1SS 

i T- / U X XO J 


J 




1 9 


1420 


1213 


1434 


1186 


MOQ 


11 70 


14^7 1 1*;*? 

It- J / XXU^ 






14 


1410 


1190 


1388 


1156 


\-D 1 1 


11 SI 


1 ■2Qfi 1 1 KO 


7 

ij 




1 fi 


1395 


1167 


1402 


1156 


1 ^Q^^ 


1157 


1 ^87 114*; 


J 
J 




1 S 
lo 


1376 


1164 


1375 


1139 


1 




1 1 1 S"? 


J 






1386 


1154 


1397 


1130 


1 ^71 

VD I V 


1111 


1^70 lldO 


7 

J 




94 

Z4: 


1371 


1148 


1348 


1155 




J. 


1 ^Qi 1 1 sn 

XXJU 


7 
J 






1383 


1174 


1406 


1174 




1 177 


1404 I'JOQ 


7 




4n 


1458 


1211 


1455 


1224 


1441 




1454 121S 








logitboost 


abc-logit 














V = 


0.04 


V = 


0.06 


u = 


0.08 


V = 0.1 


J 




4 


1460 


1296 


1471 


1241 


1452 


1202 


1446 1208 


J 




6 


1390 


1143 


1394 


1117 


1382 


1090 


1374 1074 


J 




8 


1336 


1089 


1332 


1080 


1311 


1066 


1297 1046 


J 




10 


1289 


1062 


1285 


1067 


1380 


1034 


1273 1049 


J 




12 


1251 


1058 


1247 


1069 


1261 


1044 


1243 1051 


J 




14 


1247 


1063 


1233 


1051 


1251 


1040 


1244 1066 


J 




16 


1244 


1074 


1227 


1068 


1231 


1047 


1228 1046 


J 




18 


1243 


1059 


1250 


1040 


1234 


1052 


1220 1057 


J 




20 


1226 


1084 


1242 


1070 


1242 


1058 


1235 1055 


J 




24 


1245 


1079 


1234 


1059 


1235 


1058 


1215 1073 


J 




30 


1232 


1057 


1247 


1085 


1229 


1069 


1230 1065 


J 




40 


1246 


1095 


1255 


1093 


1230 


1094 


1231 1087 



21 



Table 11: Letterlk. Upper table: The test mis-classification errors ofmar? and aic-marf (bold numbers). 
Bottom table: The test mis-classification errors of logitboost and abc-logitboost (bold numbers) 









mart 


abc-mart 
















V = 0.04 


V = 0.06 


V = 


: 0.08 


1/ -- 


= 0.1 


J 


= 


4 


2694 2512 


2698 2470 


2684 


2419 


2689 


2435 


J 


= 


6 


2683 2360 


2664 2321 


2640 


2313 


2629 


2321 


J 


= 


8 


2569 2279 


2603 2289 


2563 


2259 


2571 


2251 


J 


= 


10 


2534 2242 


2516 2215 


2504 


2210 


2491 


2185 


J 


= 


12 


2503 2202 


2516 2215 


2473 


2198 


2492 


2201 


J 


= 


14 


2488 2203 


2467 2231 


2460 


2204 


2460 


2183 


J 




16 


2503 2219 


2501 2219 


2496 


2235 


2500 


2205 


J 




18 


2494 2225 


2497 2212 


2472 


2205 


2439 


2213 


J 




20 


2499 2199 


2512 2198 


2504 


2188 


2482 


2220 


J 




24 


2549 2200 


2549 2191 


2526 


2218 


2538 


2248 


J 




30 


2579 2237 


2566 2232 


2574 


2244 


2574 


2285 


J 




40 


2641 2303 


2632 2304 


2606 


2271 


2667 


2351 








logitboost 


abc-logit 
















V = 0.04 


V = 0.06 


v = 


: 0.08 


u -■ 


= 0.1 


J 




4 


2629 2347 


2582 2299 


2580 


2256 


2572 


2231 


J 




6 


2427 2136 


2450 2120 


2428 


2072 


2429 


2077 


J 




8 


2336 2080 


2321 2049 


2326 


2035 


2313 


2037 


J 




10 


23162044 


23062003 


2314 


2021 


2307 


2002 


J 




12 


2315 2024 


2315 1992 


2333 


2018 


2290 


2018 


J 




14 


2317 2022 


2305 2004 


2315 


2006 


2292 


2030 


J 




16 


2302 2024 


2299 2004 


2286 


2005 


2262 


1999 


J 




18 


2298 2044 


2277 2021 


2301 


1991 


2282 


2034 


J 




20 


2280 2049 


2268 2021 


2294 


2024 


2309 


2034 


J 




24 


2299 2060 


2326 2037 


2285 


2021 


2267 


2047 


J 




30 


2318 2078 


2326 2057 


2304 


2041 


2274 


2045 


J 




40 


22812121 


2267 2079 


2294 


2090 


2291 


2110 
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6 Conclusion 



Classification is a fundamental task in machine learning. This paper presents extensive experiment 
results of four tree-based boosting algorithms: mart, abc-mart, (robust) logitboost), and abc-logitboost, 
for multi-class classification, on a variety of publicly available datasets. From the experiment results, 
we can conclude the following: 

1 . Abc-mart considerably improves mart. 

2. Abc-logitboost considerably improves ( robust) logitboost. 

3. (Robust) logitboost considerably improves mart on most datasets. 

4. Abc-logitboost considerably improves abc-mart on most datasets. 

5. These four boosting algorithms (especially abc-logitboost) outperform SVM on many datasets. 

6. Compared to the best deep learning methods, these four boosting algorithms (especially abc- 
logitboost) are competitive. 
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